Clustering huge protein sequence sets in linear time
نویسندگان
چکیده
منابع مشابه
Projected Clustering for Huge Data Sets in MapReduce
Fast growing data sets with a very high number of attributes become a common situation in social, industry and scientific areas. A meaningful analysis of these data sets requires sophisticated data mining techniques as projected clustering that are able to deal with such complex data. In this work, we investigate solutions for extending the state-of-theart projected clustering algorithm P3C for...
متن کاملClustering huge data sets for parametric PET imaging.
A new preprocessing clustering technique for quantification of kinetic PET data is presented. A two-stage clustering process, which combines a precluster and a classic hierarchical cluster analysis, provides data which are clustered according to a distance measure between time activity curves (TACs). The resulting clustered mean TACs can be used directly for estimation of kinetic parameters at ...
متن کاملClustering sequence sets for motif discovery
Most of existing methods for DNA motif discovery consider only a single set of sequences to find an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabli...
متن کاملExact clustering in linear time
The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear-time comp...
متن کاملExact Subspace Clustering in Linear Time
Subspace clustering is an important unsupervised learning problem with wide applications in computer vision and data analysis. However, the state-of-the-art methods for this problem suffer from high time complexity—quadratic or cubic in n (the number of data instances). In this paper we exploit a data selection algorithm to speedup computation and the robust principal component analysis to stre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Nature Communications
سال: 2018
ISSN: 2041-1723
DOI: 10.1038/s41467-018-04964-5